Introduction to Open Data Science - Course Project

Chapter 1: About the project

date()
## [1] "Thu Dec 09 16:17:32 2021"

Well, I had never coded in my life before last January, so I am still a bit of a “noob” in these type of tasks. Regardless, I am a fast learner. I have been able to design some robust pipelines for data analysis, but I have maintained them locally. It is very exciting for me to start learning to use GitHub and all of its applications. The description of the course got me really excited since I am always copy/pasting code and referencing the source, and I think these abilities (those shown in the course) will enable me to be more agile in my coding.
Here is my forked repository:
https://github.com/GonzalezVictorM/IODS-project or here


Chapter 2: Data wrangling and analysis

date()
## [1] "Thu Dec 09 16:17:32 2021"

Here we go again…

Remember! Set your working directory and load your libraries.

knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")

library(tidyverse)
library(ggplot2)
library(GGally)
library(hrbrthemes)
library(janitor)

The data we will be working on comes from international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.

The data shows the learning approaches in attitude, surface, deep and strategic learning of the subjects including their gender, age and points obtained. This data has been pre-processed for analysis by consolidating 60 variables into 7 of importance. As well, from 183 observations, 166 were selected as valuable by filtering Points > 0.

First, we read the data and explore the dimensions, structure and head.

learning2014 <- read.table("learning2014.csv", header = T, sep = ",")

dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

As we can observe, the data contains 166 observation of 7 variables as expected. The columns included are gender, age, attitude, deep, strategic, surface, and points. Looking at the head, we confirm the organization of the table matches the dimensions and structure.

Once we understand the structure of our data, we can move to the analysis. The goal is to understand the effect that the different learning variables have over the points in the examination. This can be aided by observing if gender or age have an effect over the results.

Our first step is to do Exploratory Data Analysis using scatter plots, box plots, and histograms as well as descriptive statistics such as correlation coefficients, mean and variance. Thankfully, ggpairs consolidates most of that information in a single figure with the following code.

ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

But, what should we focus on?

We can start by observing the age and gender distributions that are described in the first and second columns. From there, we can conclude that:

  • There are almost twice more female than male subjects.
  • At first sight, neither gender nor age seem to cause a big effect over the learning types.
  • The age distribution is similar for both genders.
  • There is a strong negative correlation between surface learning against attitude and deep learning in male subjects.

Additionally, I personally prefer to look at violin plots instead of boxplots due to violin plots providing also the density of the data. (The explanation to this plot can be found here).

# Narrow the data for Learning type scores of Female subjects
female_lt <- learning2014 %>%
  filter(gender == "F") %>%
  .[,c(3:6)] %>%
  gather(key="LearningType", value="Score") %>%
  mutate(gender = "F")

# Narrow the data for Learning type scores of Male subjects
male_lt <- learning2014 %>%
  filter(gender == "M") %>%
  .[,c(3:6)] %>%
  gather(key="LearningType", value="Score") %>%
  mutate(gender = "M")

# Bind female and male data
both_lt <- rbind(female_lt,male_lt)

# Print the data
both_lt %>%
  mutate(LearningStyle = paste0(LearningType, "\n", gender)) %>%
  ggplot(aes(x = LearningStyle, y = Score, fill = gender)) +
    geom_violin(width = 1, alpha = 0.5) +
    geom_boxplot(width = 0.1, alpha = 0.2)

This plot helps us confirm that there are no uneven densities in the different genders and learning types supporting the normality of each variable. Still we can see outliers in the lower tail of attitude M, deep F, and deep M. We will keep this in mind when we move to the diagnostic plots.

Lastly, we can look at the mean and standard deviation of the scores separated by gender in the form of a table. In the following table, you can notice the similarities between the genders.

learning2014 %>%
  group_by(gender) %>%
  summarise(n=n(),
            attitudeMean = mean(attitude),
            attitudeSD = sd(attitude),
            deepMean = mean(deep),
            deepSD = sd(deep),
            StraMean = mean(stra),
            StraSD = sd(stra),
            surfMean = mean(surf),
            surfSD = sd(surf),
            pointsMean = mean(points),
            pointsSD = sd(points)) %>% # The summarise function gives you the statistics that you want.
  t %>% as.data.frame %>%
  row_to_names(row_number = 1)
##                      F         M
## n                  110        56
## attitudeMean  2.990000  3.442857
## attitudeSD   0.7251637 0.6463484
## deepMean      3.656818  3.724702
## deepSD       0.5349331 0.5924439
## StraMean      3.201136  2.964286
## StraSD       0.7477212 0.8008214
## surfMean      2.829545  2.703869
## surfSD       0.4580220 0.6423441
## pointsMean    22.32727  23.48214
## pointsSD      5.826402  6.006031

The box plot, the violin plot and the table have the same goal, to provide descriptive statistics for us to identify unusual constructs in our data such as outliers or non-Normal distribution of our results. This last one is particularly important since some modelling strategies assume normality in the data.

Once we observe the effects of gender and age, we can continue to evaluate the correlation between the learning types and the points of the examination. This can be done from the figure above or by plotting each scatter plot with their linear models separately. I always like to double click into each correlation, so here is a code for that.

# Attitude vs points
cor <- signif(cor(learning2014$attitude,learning2014$points),2) #calculate the correlation between the two variables
ggplot(learning2014, aes(x = attitude, y = points, col = gender)) + # Plot the variables stratified by gender
  geom_point() +
  geom_smooth(method ="lm", formula = 'y ~ x') + # Include the linear model
  ggtitle(paste("Comparison of attitude and points by gender. Correlation = ", cor)) # Create a title including the correlation coefficient

# Deep vs points
cor <- signif(cor(learning2014$deep,learning2014$points),2)
ggplot(learning2014, aes(x = deep, y = points, col = gender)) +
  geom_point() +
  geom_smooth(method ="lm", formula = 'y ~ x') +
  ggtitle(paste("Comparison of deep and points by gender. Correlation = ", cor))

# Strategic vs points
cor <- signif(cor(learning2014$stra,learning2014$points),2)
ggplot(learning2014, aes(x = stra, y = points, col = gender)) +
  geom_point() +
  geom_smooth(method ="lm", formula = 'y ~ x') +
  ggtitle(paste("Comparison of strategic and points by gender. Correlation = ", cor))

# Surface vs points
cor <- signif(cor(learning2014$surf,learning2014$points),2)
ggplot(learning2014, aes(x = surf, y = points, col = gender)) +
  geom_point() +
  geom_smooth(method ="lm", formula = 'y ~ x') +
  ggtitle(paste("Comparison of surf and points by gender. Correlation = ", cor))

From these plots we can conclude:

  • There is no significant correlation between points and deep learning for either male or female subjects.
  • There is a higher correlation between points and attitude than any other of the learning types for both male and female subjects.
  • Even when not as big as correlation between points and attitude, strategic and *surface** learning should be evaluated by multiple linear regression analysis to be excluded.

We start by including the three variables of interest into the multiple linear model and looking at the ANOVA presented by the summary() function.

my_model <- lm(points ~ attitude + stra + surf, data = learning2014) # Create the model

summary(my_model) # Print the summary of results
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

First thing that jumps to view is the t statistic and its associated p-values. Both the intercept and the attitude coefficient give significant values (<0.05), while the rest remain non-significant. The p-value of the t statistic tells us the probability that the variation in the target variable being explained by the explanatory variable is happening by chance (due to standard error). When this probability is under our significance level, usually 0.05, we consider that the coefficient of our explanatory variable significantly explains the variation in the target variable. Therefore, we tentatively conclude that at least the attitude has a significant relationship with point.

Then, we observe the F-statistic and its associated p-value. The F-statistic tests the regression as a whole, which avoids the problems of p-values in multiple testing (which I think will be covered later in the course). Therefore, we also expect it to have a significant (<0.05) value, which it does! This means that at least one of the explanatory variables is causing this model to explain the variation in the target variable better than the Null distribution (all coefficients equal to zero).

Lastly, we look at the multiple R-squared. This tells us how much of the variation in our target variable is explained by the model instead of the error. We can see that 20.7% of the variation is explained by our explanatory variables. This is low, but we still have work to do.

When working with multiple linear regressions, you want to remove/add one explanatory variable at a time. Here we will take away the highest p-value of the t statistic since this tells us it is the explanatory variable that least explains our target variable’s variation. The chosen one is … surface.

We will now run the model again.

my_model2 <- lm(points ~ attitude + stra, data = learning2014)

summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

We can first see that all the t statistic associated p-values radically decreased. Still, strategic remains above the typical threshold of 0.05. This mean there is more than a 5% probability that the variation explained by that coefficient could be happening by chance. We might have to remove it but let us look at the other statistical descriptors first.

The F-statistic increased and its value decreased. This tells us the this model is better at explaining the variation of our target variable then the last one.

Lastly, the the multiple R-squared lightly increased; which tells us that the model now explains 21% of the variation in the target variable.

Even with the improved results, I wanna test what happens if we remove our highest t value associated p-value explanatory statistic to evaluate is the model works better without it.

my_model3 <- lm(points ~ attitude, data = learning2014)

summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The t statistic associated p-value, F statistic associated p-value did not significantly change (less than one order of magnitude). Additionally, the multiple R-squared got worse. Due to this I will keep the second model as our best chance of explaining the data.

In our selected model, Model2 (points ~ attitude + strategic), we have a 3.4658 coefficient for attitude and a 0.9137 coefficient for strategic. This means that every 1 unit increase in the attitude variable will represent 3.5 unit increase in points, while every 1 unit increase in the strategic variable will represent 0.9 unit increase in points.

To study our selected model, we will generate diagnostic plots. Here we use plots that visually represents the relationships of the residuals (difference between observed and fitted values) in the model. This will help us validate the assumptions made by the model.

par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))

On the first plot we have the Residuals vs the Fitted values. This plot related the residuals (error) with the calculated values for the target variable. As we can observe in the plot, the variability of the results appears to remain constant regardless of the size of the fitted values and there is no clear pattern. It confirms the assumption that the size of the errors does not depend on the explanatory variable.

The second plot is the Normal qq-plot. This plot aid in checking for symmetry and normality of the error term in the regression. We see that most of the values fall on the identity line, but both the lower and upper tails of the data are skewed. this makes ud doubt the normality assumption of the errors in the model.

Lastly, we have the residuals vs leverage plot. The leverage plot helps us observe the influence of specific points in the data. We can see that there is no clear point with an increased impact over the model.


Chapter 3: Logistic Regression

date()
## [1] "Thu Dec 09 16:17:44 2021"

Here we go again…

Remember! Set your working directory and load your libraries.

knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")

library(tidyverse)
library(ggplot2)
library(GGally)
library(ggpubr)
library(boot)

We read the data.

pormath <- read.csv("pormath.csv", header = T, sep = ",")

The data comes from P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. The data explores the attributes of student lives and their effect on their academic performance in math and Portuguese classes. For more information on the data set, visit this link.

For this analysis, we sill be focusing on the alcohol use of the students and the effect that various attributes might have on it.

We first explore the dimensions and structure.

dim(pormath)
## [1] 370  50
str(pormath)
## 'data.frame':    370 obs. of  50 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ id.p      : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ id.m      : int  2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ failures.m: int  0 0 3 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "no" "no" "no" "no" ...
##  $ paid.m    : chr  "no" "no" "yes" "yes" ...
##  $ absences.p: int  4 2 6 0 0 6 0 2 0 0 ...
##  $ absences.m: int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1.p      : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G1.m      : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2.p      : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G2.m      : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3.p      : int  11 11 12 14 13 13 13 13 17 13 ...
##  $ G3.m      : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ failures  : num  0 0 1.5 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ absences  : num  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num  2.5 7 9.5 14.5 8.5 13.5 12.5 8 15.5 13 ...
##  $ G2        : num  8.5 8 10.5 14 11.5 13.5 12 9 17 13.5 ...
##  $ G3        : num  8.5 8.5 11 14.5 11.5 14 12 9.5 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

Since we are exploring the alcohol use, we will select the variables high_use and alc_use. The variable high_use marks students that consume alcohol on a frequency above 2 (numeric: from 1 - very low to 5 - very high) during the whole week while alc_use actual value (numeric: from 1 - very low to 5 - very high). Once we have our target variables, we select the effect variables that we wish to study. I am interested in finding out if age, health, or the family situation affect the alcohol consumption. Therefore, I will select age (numeric: from 15 to 22), health (numeric: from 1 - very bad to 5 - very good), famrel (numeric: from 1 - very bad to 5 - excellent), and famsup (binary: yes or no). Additionally, it is always interesting to take a look at the influence of gender, so I will be including sex (binary: ‘F’ - female or ‘M’ - male).

My hypothesis are:

  • Due to the young age of the students, health will not show a significant relationship with high_use.
  • Both famrel and famsup will have an effect in alcohol consumption. I expect that the better the family situation, the least consumption.
  • Age will have a strong effect over high_use, simply due to acquisitive power and opportunity.

The three formats for the data shown below will allow us to create different types of graphs as it will be seen in the rest of the report.

## Select the columns as they are
pormath_subset <- pormath %>% 
  select(sex, high_use, alc_use, age, health, famrel, famsup)

## Turn sex, health, famrel, and famsup into factors
pormath_subset_fct <- pormath %>% 
  select(sex, high_use, alc_use, age, health, famrel, famsup) %>%
  mutate(across(c(sex, health, famrel, famsup), factor))

## Turn health, famrel, and famsup into numeric values
pormath_subset_nmr <- pormath_subset_fct %>%
  mutate(across(c(health, famrel, famsup), as.numeric))

As a first look, we can explore our variables numerically. We can do this both as numeric variables:

pormath_subset_nmr %>%
  summary
##  sex      high_use          alc_use           age            health     
##  F:195   Mode :logical   Min.   :1.000   Min.   :15.00   Min.   :1.000  
##  M:175   FALSE:259       1st Qu.:1.000   1st Qu.:16.00   1st Qu.:3.000  
##          TRUE :111       Median :1.500   Median :17.00   Median :4.000  
##                          Mean   :1.889   Mean   :16.58   Mean   :3.562  
##                          3rd Qu.:2.500   3rd Qu.:17.00   3rd Qu.:5.000  
##                          Max.   :5.000   Max.   :22.00   Max.   :5.000  
##      famrel          famsup     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:1.000  
##  Median :4.000   Median :2.000  
##  Mean   :3.935   Mean   :1.624  
##  3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :2.000

or as factors and their frequency:

pormath_subset_fct %>%
  summary
##  sex      high_use          alc_use           age        health  famrel 
##  F:195   Mode :logical   Min.   :1.000   Min.   :15.00   1: 46   1:  8  
##  M:175   FALSE:259       1st Qu.:1.000   1st Qu.:16.00   2: 42   2: 18  
##          TRUE :111       Median :1.500   Median :17.00   3: 80   3: 64  
##                          Mean   :1.889   Mean   :16.58   4: 62   4:180  
##                          3rd Qu.:2.500   3rd Qu.:17.00   5:140   5:100  
##                          Max.   :5.000   Max.   :22.00                  
##  famsup   
##  no :139  
##  yes:231  
##           
##           
##           
## 

We can observe the following:

  • The number of male and females is quite similar.
  • More than twice as much students do not have high use of alcohol compared to those who do.
  • 75% of the students have a 4 or 5 family relationship
  • Almost twice as much students have family support compared to those who don’t.

Now we look at the graphical representation separated by high_use:

pormath_subset %>%
  select(-sex) %>% # remove sex
  ggpairs(mapping = aes(col = high_use, alpha = 0.3), 
          lower = list(combo = wrap("facethist", bins = 20)))

and sex:

pormath_subset %>%
  select(-high_use) %>% # remove high_use
  ggpairs(mapping = aes(col = sex, alpha = 0.3), 
          lower = list(combo = wrap("facethist", bins = 20)))

Immediately we can notice that:

  • alc_use has a strong correlation to age and famrel but mostly on male students.
  • health has a strong correlation famrel but only on male students.

We can now look at the means and std deviations of the data by cross-tabulation:

pormath_subset_nmr %>%
  group_by(high_use, sex) %>%
  summarise(
    AvgAge = mean(age),
    sdAge = sd(age),
    AvgHealth = mean(health),
    sdHealth= sd(health),
    AvgFamRel = mean(famrel),
    sdFamRel = sd(famrel),
    AvgFamSup = mean(famsup),
    sdFamSup = sd(famsup))
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 10
## # Groups:   high_use [2]
##   high_use sex   AvgAge sdAge AvgHealth sdHealth AvgFamRel sdFamRel AvgFamSup
##   <lgl>    <fct>  <dbl> <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1 FALSE    F       16.6  1.09      3.37     1.40      3.92    0.926      1.69
## 2 FALSE    M       16.3  1.24      3.67     1.41      4.13    0.833      1.55
## 3 TRUE     F       16.5  1.12      3.39     1.55      3.73    0.837      1.73
## 4 TRUE     M       16.9  1.23      3.93     1.28      3.79    0.991      1.51
## # ... with 1 more variable: sdFamSup <dbl>

But this is hard to read, so let us show it graphically. In the following graph we have the average values of each variable with their std error bars.

pormath_subset_nmr %>%
  pivot_longer(cols = alc_use:famsup, names_to = "Attribute") %>%
  group_by(high_use, sex, Attribute) %>%
  summarise(Avg = mean(value), 
            se = sd(value)/sqrt(length(value))) %>%
  filter(Attribute != "alc_use") %>%
  ggplot(aes(x = sex, y = Avg, fill = high_use)) +
  geom_bar(position = position_dodge(), 
           stat = "identity", 
           alpha = 0.6) +
  geom_errorbar(aes(x = sex, ymin = Avg-se, ymax = Avg+se),
                position = position_dodge(width = 0.9),
                width = 0.6,
                alpha = 0.6)+
  facet_wrap("Attribute", scales = "free") +
  theme_classic()
## `summarise()` has grouped output by 'high_use', 'sex'. You can override using the `.groups` argument.

As expected, high_use presents difference in means larger than standard error in age and famrel but mostly on male students. As well, health shows some difference in the means famrel but only on male students and less than the standard error.

Now we look at box plots of each variable to confirm the data distribution.

First, age:

g1 <- pormath_subset_nmr %>%
  ggplot(aes(x = high_use, y = age, fill = sex)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap("sex") +
  theme_classic()

g2 <- pormath_subset_nmr %>%
  ggplot(aes(x = alc_use, y = age, group = alc_use, fill = sex)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap("sex") +
  theme_classic()

ggarrange(g1, g2, ncol = 1)

On male particularly, there seems to be an increase in alc_use as age increases.

Then, health:

g1 <- pormath_subset_nmr %>%
  ggplot(aes(x = high_use, y = health, fill = sex)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap("sex") +
  theme_classic()

g2 <- pormath_subset_nmr %>%
  ggplot(aes(x = alc_use, y = health, group = alc_use, fill = sex)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap("sex") +
  theme_classic()

ggarrange(g1, g2, ncol = 1)

No clear pattern is vissible.

Now, famrel:

g1 <- pormath_subset_nmr %>%
  ggplot(aes(x = high_use, y = famrel, fill = sex)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap("sex") +
  theme_classic()

g2 <- pormath_subset_nmr %>%
  ggplot(aes(x = alc_use, y = famrel, group = alc_use, fill = sex)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap("sex") +
  theme_classic()

ggarrange(g1, g2, ncol = 1)

I looks like students with high_use tend to have lower famrel.

famsup is a bit tricky, since it is a factor variable. Therefore we will use frequency and mean plots instead of box plots.

g1 <- pormath_subset_fct %>%
  group_by(sex, famsup, high_use) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = famsup, y = count, fill = high_use )) +
    geom_bar(position = "dodge",stat = "identity",alpha = 0.4) +
    facet_wrap("sex") +
    theme_classic()
## `summarise()` has grouped output by 'sex', 'famsup'. You can override using the `.groups` argument.
g2 <- pormath_subset_fct %>%
  group_by(sex, famsup) %>%
  summarise( se = sd(alc_use)/sqrt(length(alc_use)),
             alc_use = mean(alc_use)) %>%
  ggplot(aes(x = famsup, y = alc_use, fill = sex)) +
  geom_bar(position = "dodge",stat = "identity",alpha = 0.4) +
  geom_errorbar(aes(x = famsup, ymin = alc_use-se, ymax = alc_use+se),
                position = position_dodge(width = 0.9),
                width = 0.6,
                alpha = 0.6) +
  facet_wrap("sex") +
  theme_classic()
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
ggarrange(g1, g2, ncol = 1)

alc_use does not seem to significantly change based on the famsup.

Lastly, I wanted to look at their average academic performance,which is depicted in the G3 final grade (numeric: from 0 to 20, output target):

g1 <- pormath %>%
  ggplot(aes(x = high_use, y = G3, fill = sex)) +
  geom_violin(width = 1, alpha = 0.5) +
  geom_boxplot(position = position_dodge(width =1),
               width = 0.1,
               alpha = 0.2) +
  facet_wrap("sex") +
  theme_classic()

g2 <- pormath %>%
  ggplot(aes(x = alc_use, y = G3, group = alc_use, fill = sex)) +
  geom_violin( width = 0.5, alpha = 0.5) +
  geom_boxplot( width = 0.1, alpha = 0.2) +
  facet_wrap("sex") +
  theme_classic()

ggarrange(g1, g2, ncol = 1)
## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

While there is no significant change in the means, note how the variation in the final grade is much less as the alc_use and high_use increase. This means that the performance of those students that consume less alcohol has a wider distribution.

But, let us focus on building our model. We start by including the four variables of interest into the logistic regression model and looking at the ANOVA presented by the summary() function.

model1 <- glm(high_use ~ age + health + famrel + famsup, 
              data = pormath_subset, 
              family = "binomial")

summary(model1)
## 
## Call:
## glm(formula = high_use ~ age + health + famrel + famsup, family = "binomial", 
##     data = pormath_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4952  -0.8754  -0.7331   1.3372   2.0207  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -3.65006    1.75593  -2.079   0.0376 * 
## age          0.21535    0.09932   2.168   0.0301 * 
## health       0.16502    0.08600   1.919   0.0550 . 
## famrel      -0.32958    0.12617  -2.612   0.0090 **
## famsupyes   -0.14944    0.23926  -0.625   0.5322   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 437.98  on 365  degrees of freedom
## AIC: 447.98
## 
## Number of Fisher Scoring iterations: 4

famsup is a categorical variable, so the tests are performed a bit differently. Here, the Wald test is performed to test whether the difference between the coefficient of the reference (no) and the level (yes) is different from zero. Therefore, we know that there is no significant difference between this levels. To confirm if the whole variable isn’t significant, we remove the intercept.

model2 <- glm(high_use ~ age + health + famrel + famsup - 1, 
              data = pormath_subset, 
              family = "binomial")

summary(model2)
## 
## Call:
## glm(formula = high_use ~ age + health + famrel + famsup - 1, 
##     family = "binomial", data = pormath_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4952  -0.8754  -0.7331   1.3372   2.0207  
## 
## Coefficients:
##           Estimate Std. Error z value Pr(>|z|)   
## age        0.21535    0.09932   2.168   0.0301 * 
## health     0.16502    0.08600   1.919   0.0550 . 
## famrel    -0.32958    0.12617  -2.612   0.0090 **
## famsupno  -3.65006    1.75593  -2.079   0.0376 * 
## famsupyes -3.79950    1.72571  -2.202   0.0277 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 437.98  on 365  degrees of freedom
## AIC: 447.98
## 
## Number of Fisher Scoring iterations: 4

By removing the intercept, we are now testing if each coefficient of the categorical variable is different from zero. We see that the individual coefficients are now considered significant.

To resolve this contradiction, we can create a third model that does not contain the variable famsup and compare it to our first model using a likelihood ratio test. This will tell us the probability of all coefficients from famsup being zero.

model3 <- glm(high_use ~ age + health + famrel, 
              data = pormath_subset, 
              family = "binomial")

summary(model3)
## 
## Call:
## glm(formula = high_use ~ age + health + famrel, family = "binomial", 
##     data = pormath_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4559  -0.8625  -0.7425   1.3298   2.0631  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -3.86577    1.72440  -2.242  0.02497 * 
## age          0.22295    0.09865   2.260  0.02382 * 
## health       0.16310    0.08593   1.898  0.05770 . 
## famrel      -0.32859    0.12608  -2.606  0.00916 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 438.37  on 366  degrees of freedom
## AIC: 446.37
## 
## Number of Fisher Scoring iterations: 4
anova(model1, model3, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ age + health + famrel + famsup
## Model 2: high_use ~ age + health + famrel
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     437.98                     
## 2       366     438.37 -1  -0.3887    0.533

We see the high Pr(>Chi) tells us that the variable should not remain in the model.

Now we can also see that health is not significant even though it is quite close. Therefore we will try removing it.

model4 <- glm(high_use ~ age + famrel, 
              data = pormath_subset, 
              family = "binomial")

summary(model4)
## 
## Call:
## glm(formula = high_use ~ age + famrel, family = "binomial", data = pormath_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3086  -0.8605  -0.7384   1.3540   1.8530  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -3.14909    1.66508  -1.891   0.0586 .
## age          0.20841    0.09761   2.135   0.0328 *
## famrel      -0.29917    0.12361  -2.420   0.0155 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 442.08  on 367  degrees of freedom
## AIC: 448.08
## 
## Number of Fisher Scoring iterations: 4

This caused the intercept to loose significance, but the remaining variables are marked as significant. Regardless, after running the predictive and cross-validation analysis, model 3 outperformed model 4. Therefore, I consider it to be the best available model with these variables.

So regarding my hypothesis:

  • FALSE Even though health did not show a significant relationship with high_use, its inclusion in the model improved its performance.
  • PARTLY TRUE Only famrel and not famsup had an effect in alcohol consumption.
  • TRUE Age had a significant effect over high_use.

So let’s look at the odds ratios with their CI’s.

model <- model3

OR <- model %>% 
  coef %>% 
  exp

CI <- model %>% 
  confint %>% 
  exp
## Waiting for profiling to be done...
m_OR <- cbind(OR, CI)
m_OR
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02094671 0.0006882756 0.6060803
## age         1.24976103 1.0307101817 1.5191667
## health      1.17715618 0.9972312442 1.3977846
## famrel      0.71994050 0.5607948332 0.9210363

What we can conclude from these coefficients is that:

  • Higher age shows increased probability (1.24) of having high_use.
  • (Not fully significant) Better health shows higher probability (1.17) of having high_use.
  • Higher fam_rel shows decreased probability (0.72) of not having high_use.

A 2X2 cross tabulation can give us a better idea of the predictive power of the model:

probs <- predict(model, type = "response") # Predict probability responses based on the variables

pormath_predict <- pormath_subset %>%
  mutate(probability = probs) %>% # add them to our data.frame
  mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value

# Plot the 2X2 cross tabulation of the predictions
table(high_use = pormath_predict$high_use, 
      prediction = pormath_predict$prediction) %>%
  prop.table %>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68648649 0.01351351 0.70000000
##    TRUE  0.28648649 0.01351351 0.30000000
##    Sum   0.97297297 0.02702703 1.00000000

Around 70% of the values were predicted correctly. We can also see a graphical depiction:

pormath_predict %>%
  ggplot(aes(x = probability, y = high_use, col = prediction)) + 
  geom_point() +
  theme_classic()

We also can compute the total proportion of inaccurately classified individuals or training error by using the following function:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

We can apply it to the data and compare its performance against a 10-fold cross-validation.

te <- loss_func(class = pormath_predict$high_use, prob = pormath_predict$probability)

set.seed(1)
cv <- pormath_predict %>%
  cv.glm(cost = loss_func, glmfit = model, K = 10)

cat(paste("Training error: ", te, "\n10-fold CV: ", cv$delta[1]))
## Training error:  0.3 
## 10-fold CV:  0.308108108108108

Using the variables I selected, I could not make the model have a better performance than that of the DataCamp exercise.


Chapter 4: Clustering and classification

date()
## [1] "Thu Dec 09 16:18:03 2021"

Here we go again…

Remember! Set your working directory and load your libraries.

knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")

library(tidyverse)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)
library(car)

We read the data.

data("Boston")

“Housing Values in Suburbs of Boston” is a data set native to the MASS package in R. The data set includes various indicators for the different Boston suburbs. Further explanation on the variables can be found here.

We first explore the dimensions and structure.

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The data set contains mostly numeric variables with the exception of rad (index of accessibility to radial highways) and chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)). They are a ranked factor and a logical variable respectively. Therefore we convert them and look at the results.

Note: Because of rad increased number of factor levels, we will keep as integer.

Boston <- Boston %>%
  mutate(chas = as.logical(chas))

summary(Boston)
##       crim                zn             indus          chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Mode :logical  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   FALSE:471      
##  Median : 0.25651   Median :  0.00   Median : 9.69   TRUE :35       
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14                  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10                  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74                  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The past table provides a lot on information on quartile distribution of our variables, but I always rather look at the variables with boxplots or barplots:

Boston %>%
  pivot_longer(cols = c(1:14), names_to = "Attribute", values_to = "Value") %>%
  filter(!Attribute %in% c("chas","rad")) %>%
  ggplot(aes(x = Attribute, y = Value)) +
  geom_boxplot() +
  facet_wrap("Attribute", scales = "free") +
  theme_classic()

Boston %>%
  pivot_longer(cols = c(1:14), names_to = "Attribute", values_to = "Value") %>%
  filter(Attribute %in% c("chas","rad")) %>%
  ggplot(aes(x = Value)) +
  geom_histogram() +
  facet_wrap("Attribute", scales = "free") +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can conclude:

  • 93% of towns do not have boundaries to the Charles River.
  • Below 10 index of accessibility to radial highways, there seems to be a normal distribution that is disrupted by those with 24 index.
  • 3rd quartile of cities only present up to 3.67 crimes per capita.
  • Only the 1st quartile is under 375 of black population (see calculation here).

Additionally, we can look at the normality of the non-factorial distributions using the qqPlot.

colnames(Boston)[-c(4,9)] %>% sapply(function(x){
  qqPlot(Boston[,x], main = x)
})

##      crim  zn indus nox  rm age dis tax ptratio black lstat medv
## [1,]  381  58   489 143 366  42 354 489     197   451   375  162
## [2,]  419 200   490 144 365  75 352 490     198   424   415  163

The paired statistics and scatter plots of the variables are also useful to start looking at the data.

Boston %>%
  ggpairs(mapping = aes(alpha = 0.3), 
          lower = list(combo = wrap("facethist", bins = 20)),
          upper = list(continuous = wrap("cor", size = 1.5)))

From here we can see that:

  • Crime presents correlation above 0.3 with all variables except for zn (proportion of residential land zoned for lots over 25,000 sq.ft), rm (average number of rooms per dwelling), and ptratio (pupil-teacher ratio by town).
  • Crime shows a positive relation with indus (proportion of non-retail business acres per town), nox (nitrogen oxides concentration (parts per 10 million)), age (proportion of owner-occupied units built prior to 1940), rad, tax (full-value property-tax rate per $10,000), and lstat (lower status of the population (percent)).
  • Crime shows a negative relation with dis (weighted mean of distances to five Boston employment centres), black, and medv (median value of owner-occupied homes in $1000s).

Another way of looking at this values is a correlation plot (corrplot or ggcorr):

data("Boston")

cor_mat <- cor(Boston) %>% round(digits = 2)

corrplot(cor_mat, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

ggcorr(Boston, geom = "circle", method = c("everything", "pearson"))

These show the same information as the ggpairs plot.

due to the different dimensions of each variable, it is hard to compute the influence of eac one and compare them reliably. Therefore, we present a summary of the scaled variables.

boston_scaled <- Boston %>%
  scale %>%
  as.data.frame

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now we can compare between them and say that crime presents the biggest range of scaled data and we can start analyzing the scaled effects of each variable over the crime rate.

First we separate the crime variable in quartiles and assign the low, med-low, med-high, and high labels to each quartile.

bins <- quantile(boston_scaled$crim) # binning in quartiles

crime <- boston_scaled$crim %>%
  cut(breaks = bins, 
      include.lowest = TRUE, 
      labels = c("low", "med_low", "med_high", "high"))

We now substitute the numeric crime variable for the binned quartiles.

boston_qt <- boston_scaled %>% 
  mutate(crim = crime)

Once we have the factor variable, it is time to generate our training and testing data sets. Our training data set will be used to generate an explanatory model for our target variable crime. The model will be applied to the test data to predict the outcome. Note that the sampling is random, so each iteration of this testing might give slightly different models.

set.seed(123) # to maintain random variables

n <- nrow(boston_qt)

ind <- sample(n,  size = n * 0.8)

train <- boston_qt[ind,]

test <- boston_qt[-ind,]

Save and remove the crime levels from the testing data.

correct_classes <- test$crim

test <- dplyr::select(test, -crim)

And now we are ready to do our linear discriminant analysis (LDA). LDA is a classification method used for the modelling of categorical variables. Let’s remember that this model assumes normality of the variables and that is why we do the scaling first. Additionally, variables should be continuous, but for this exercise we can ignore chas.

lda.fit <- lda(crim ~ ., data = train)

lda.fit
## Call:
## lda(crim ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01866313 -0.9066422 -0.08120770 -0.8835724  0.38666334 -0.9213895
## med_low  -0.07141687 -0.3429155  0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591  0.2162741  0.19544522  0.3637157  0.12480815  0.4564260
## high     -0.48724019  1.0149946 -0.03371693  1.0481437 -0.47733231  0.8274496
##                 dis        rad        tax    ptratio      black        lstat
## low       0.9094324 -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775
## med_low   0.3694282 -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124
## high     -0.8601246  1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309
##                 medv
## low       0.47009410
## med_low   0.01548761
## med_high  0.19440747
## high     -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas     0.002460776  0.03963849  0.1699312
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113

The prior probabilities tell us the distribution of probability of falling into either of the groups. The coefficients of linear discriminants by variable give us the the coefficient of each linear discriminant that maximize the ratio of the between-group variance to its within-group variance. Lastly, the proportion of trace is the amount of the between-group variance that is explained by each corresponding linear discriminant (eg. LD1 explains 96% of the between-group variance).

The following code takes the coefficients to draw arrows representing the size of the effect of the scaled valuable over the model and applies it to the plot below as arrows.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

We change the crime variable in our training data set to a numeric vector.

classes <- as.numeric(train$crim)

We plot the LDA fitted model (LD1 vs LD2) and separate each class using colors and figures. We also run the function to draw the arrows.

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

We can observe that rad has the biggest influence over the separation of the model. Furthermore, its influence is mostly explained by the LD1. This means that the changes in the rad variable are more likely to influence the crime rate into a high value.

Now we can test the predictive power of this model over the test data set and create a table to compare the results with those stored in correct_classes.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        4    0
##   med_low    2      17        6    0
##   med_high   1       9       13    2
##   high       0       0        1   27

We can now apply the same loss function we used before to asses the proportion of wrong predictions.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(as.numeric(correct_classes),as.numeric(lda.pred$class))
## [1] 0.3431373

We can see the 34% of the predictions were not correct.

For the next part of the analysis, we will observe the different ways of comparing the similarity of the different towns. We will use both euclidean and Manhattan distances. Euclidean is simple the geometric distance between two points, while Manhattan distance observes the absolute differences between the coordinates of two points.

First we measure the euclidean distance:

data("Boston")

boston_scaled <- Boston %>%
  scale %>%
  as.data.frame

dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

and the Manhattan distance as well:

dist_man <- dist(boston_scaled, method = 'manhattan')
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

These two distance are applied to clustering protocols by minimizing the distance between the observatoions.

One example of such clustering protocols is the k-means clustering. Note that, contrary to the LDA classification method, the clustering is an unsupervised method that does not have the final classification determined a priori. Also remember that the distance measuring method will change the output of your clustering.

So now we run the k-means clustering method on the scaled data starting by three centroids.

km <- kmeans(boston_scaled, centers = 3)

pairs(boston_scaled, col = km$cluster)

In the plot we can observe how the distribution into the clusters separates the data in the pairwise scatter plots. But this was just a guessed number of centroids. We can optimize this by using the within cluster sum of squares (WCSS).

We will run the k-means clustering method changing th number of clusters from 1 to 10. We will take the WCSS (i.e. variance between the obs in a cluster and the center of that cluster) and compare it as the number of centers increases.

so first we run all the different k in 1:10, and store the total WCSS (TWCSS)

set.seed(123) # to maintain the result of random variables.
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

Now we plot the TWCSS’s vs the number of centers.

qplot(x = 1:k_max, y = twcss, geom = 'line')

We can see that the biggest change occurs between 1 and 2 centers. We will try now the clustering with two centers.

km <- kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

As we observe the scatter plots we can appreciate that the groups ar better defined then when using 3 centers. Therefore, 2 centers is our optimal k for our clustering analysis.

For the bonus points: LDA on k-means

data("Boston")

boston_scaled <- Boston %>%
  scale %>%
  as.data.frame

set.seed(123) # to maintain random variables

km <- kmeans(boston_scaled, centers = 3)

boston_km <- boston_scaled %>% 
  mutate(crim = as.factor(km$cluster))

n <- nrow(boston_km)

ind <- sample(n,  size = n * 0.8)

train <- boston_km[ind,]

test <- boston_km[-ind,]

correct_classes <- test$crim

test <- dplyr::select(test, -crim)

lda.fit <- lda(crim ~ ., data = train)

classes <- as.numeric(train$crim)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

We can observe that rad remains the biggest influence over the separation of the model. Furthermore, its influence is mostly explained by the LD1 again. This means that the changes in the rad variable are more likely to influence the crime rate into a 1 cluster. Then age seems to have the next most influential value. Age seems to mostly separate cluster 2 and 3 in the LD2 axis.


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Thu Dec 09 16:18:37 2021"

Here we go again…

Remember! Set your working directory and load your libraries.

knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")

library(tidyverse)
library(car)
library(GGally)
library(FactoMineR)

We read the data.

human <- read.csv("human_base.csv", header = T, sep = ",")

The Human Development Index (HDI) is a summary measure of average achievement of a country in human development such as a long and healthy life with a decent standard of living. The HDI is the geometric mean of normalized indices that represent theHuman Development separated into three categories. The Gender Inequality Index (GII) reflects gender-based disadvantage based on reproductive health, empowerment and the labor market. GII is shown together with HDI to show the difference in human development between genders due to inequality between female and male achievements. The following data sets consolidate the HDI and GII indicators (17) for 195 countries. For more information on the topic, visit the following links:

http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf

http://hdr.undp.org/en/content/human-development-index-hdi

First we re-format the GNI to contain numbers intead of char.

human$GNI <- human$GNI %>%
  gsub(",","", x = .) %>%
  as.numeric

We select the variables of interest for our study and remove those country that are missing any of the indexes (i.e. NA’s).

keep <- c("Country", "Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp",
          "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")

human <- human %>%
  dplyr::select(one_of(keep))

human_ <- human %>% 
  filter(complete.cases(human))

Looking at the tail of the table, we can notice that the last seven observations are regions instead of countries. Therefore, we remove them as we change the country names into the row names.

tail(human_, 10)
##                             Country   Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI
## 153                            Chad 0.1717172 0.8080808     7.4     51.6  2085
## 154        Central African Republic 0.3782772 0.8531140     7.2     50.7   581
## 155                           Niger 0.3076923 0.4459309     5.4     61.4   908
## 156                     Arab States 0.7289916 0.3081009    12.0     70.6 15722
## 157       East Asia and the Pacific 0.8250377 0.7884131    12.7     74.0 11449
## 158         Europe and Central Asia 0.8784119 0.6514286    13.6     72.3 12791
## 159 Latin America and the Caribbean 0.9836957 0.6729323    14.0     75.0 14242
## 160                      South Asia 0.5329670 0.3711083    11.2     68.4  5605
## 161              Sub-Saharan Africa 0.7015873 0.8537859     9.6     58.5  3363
## 162                           World 0.8333333 0.6558018    12.2     71.5 14301
##     Mat.Mor Ado.Birth Parli.F
## 153     980     152.0    14.9
## 154     880      98.3    12.5
## 155     630     204.8    13.3
## 156     155      45.4    14.0
## 157      72      21.2    18.7
## 158      28      30.8    19.0
## 159      85      68.3    27.0
## 160     183      38.7    17.5
## 161     506     109.7    22.5
## 162     210      47.4    21.8
last <- nrow(human_) - 7

human_ <- human_[1:last, ]

rownames(human_) <- human_$Country

human <- human_[,-1]

Now we can take a look at the data. With 155 countries remaining,we will be studying the interactions of 8 continuous variables of int or num type.

dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : num  64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

As in the last chapter, it is hard to compare the range of the variables or their means and std deviations without first scaling, so lets look at them graphically first.

human %>% rownames_to_column(var = "Country") %>%
  pivot_longer(cols = c(-1), names_to = "Attribute", values_to = "Value") %>%
  ggplot(aes(x = Attribute, y = Value)) +
  geom_boxplot() +
  facet_wrap("Attribute", scales = "free", nrow = 2) +
  theme_classic()

Most of the variables seem to show normal distributions according to the box plots. They mostly lack outliers and the median is close to the center of the range. Maternal mortality and GNI do not show this characteristics. First of all GNI has a range at least 3 orders of magnitude bigger than any other variable. also, both of these variables show a concentrated 3 quartiles in the lower par of their range and outliers the go way higher.

Let us use the qqplots to confirm the Normaility of these distributions.

colnames(human) %>% sapply(function(x){
  qqPlot(human[,x], main = x)
})

##      Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## [1,]     153     116     155      126  30     150       155     136
## [2,]     141     141       2      133  44     153       148     104

As expected from the box plots, GNI and maternal mortality diverge from the identity line in our qqplots making their normality questionable.

To continue our analysis, we will look at the correlation coefficients. For correlation comparison, scaling is not considered necesary. Therefore, let us see the situation when the values are not scaled.

human %>%
  ggpairs(mapping = aes(alpha = 0.3))

human %>%
  ggcorr(method = c("everything", "pearson"))

From here we can see:

  • Maternal mortality and life expectancy show the highest correlation which happens to be negative. This makes sense, since the lack of health development would clearly be reflected by low healthcare availability or quality, and, thus, higher mortality rates and lower life expectancy.
  • Expected education and life expectancy show the next highest correlation, but positive this time.

We could keep discussing the correlation between variables, but due to the high complexity of the model due to the amount of variables we might reach unclear conclusions. Therefore, we will use a dimensionality reduction technique. Principal component analysis or PCA lets us focus in the main dimensions that provide the explanation for most of the variation in the data points. PCA starts by defining a dimension in which the distance between the points is maximized. This will be our first principal component since it accounts for most of the variation. From then on, the algorithm continues looking for the dimensions that maximize the remaing distances. The catch is that starting on the second principal component, this dimension must be orthogonal to the last one. For further details on PCA watch this video which I love.

So lets go at it. Note that I have not scaled the values yet.

pca_human <- prcomp(human)

s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

From the summary table you can notice that PC1 accounts for 99.99% of the variance. With this result, we know that this dimension explains all of the variance so Hurray!.

Wait, there is a catch. Let us look at the graphical representation of the first 2 PCs in a biplot compared to the size of each variables influence (the size will be represented by pink arrows that show the projection of each variables axix unto the 2-D plain created between PC1 and PC2).

pca_pr <- round(100*s$importance[2,], digits = 2) 

biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Just as expected, the lack of scaling is causing problems. As discussed GNI is over 3 orders of magnitude bigger than any other variable and, thus, any variation in that variable will cause the PCA to be centered on that dimension. That is why the projection of the GNI dimension is so prominent: PC1 is almost parallel to it.

Since PCA works by maximizing variance, we have to scale the variables for the analysis to be worthwhile.So here we go with the scaled data.

human_std <- scale(human)

pca_human <- prcomp(human_std)

s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

Much better, now we can see how the distribution of the variation is spread between the different PCs. But let us look at PC1 and PC2 since they account for almost 70% of the variation.

pca_pr <- round(100*s$importance[2,], digits = 2) 

biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Now we can see that:

  • Maternal mortality and adolescent birth rates do show a close relationship as their projections over the 2 first PCs are very similar. additionally, they are almost parallel to PC1 and almost opposed to life expectancy. Furthermore, we can observe that countries in the furthermost right section of the plot belong mostly to the African continent.
  • On the top left corner we can se the Nordics. According to the size projections of the variables,these countries would have the highest GNI, expected education, life expectancy, female over male secondary education ratio, female in parliament, and female over male secondary education ratio.

Since it is not my are of expertise I would rather not make very insightful observations, but there does seem to be a relation between the developed countries and their human development versus developing countries.

For data sets with categorical data, the MCA is a good alternative to the PCA. Here we call for the tea data set from the Factominer package.

data(tea)

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

The data used here comes from a questionnaire about tea consumption. 300 individuals were interviewed about how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).

Due to them being categorical variables, lets just look at the bar plots to see the distributions.

tea %>%
  dplyr::select(-age) %>%
  pivot_longer(cols = c(-sex), names_to = "Attribute", values_to = "Value") %>%
  ggplot(aes(Value, fill = sex)) +
  facet_wrap("Attribute", scales = "free") +
  geom_bar(position = "dodge",alpha = 0.4) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

We can see that:

  • For most variables, gender does not seem to affect the tea preferences of the interviewed.

Again, all of this variables can seem very overwhelming, but the MCA can simplify things. But even before that, let us select the variables involved in how they drink coffee.

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

tea <- dplyr::select(tea, one_of(keep_columns))

MCA uses frequencies instead of geometrical distances to calculate the patterns between categorical variables. It uses multi-dimentional cross tabulations of the data, There are various methods for this, like an indicator matrix or the Burt matrix. Their main idea is to separate the categorical variables into their categories and align them to the observation.

Using the inter- and intra-category variances, the algorythm will generate new dimensions that explain the most variance possible. Let’s try it.

mca <- tea %>% 
  MCA(graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = ., graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

We can conclude that:

  • Dim1 and Dim2 only account for 11% of the variance.
  • Even in the top 10 individuals, the contibution to the variance is not very high.
  • In the categories, we can look at the v.test to confirm that all absolute values above 1.96 are significantly different to 0.
  • In the categories, the squared correlation with the variable and the dimension tells us if there is a srong relationship of the variable explaining the variance.

It is much easier to interpret by looking at the graphical depiction of this analysis.

plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

Note that this is the categories plot. It will tell us which categories are closely related to each other. We can therefore concluded:

  • Those buying tea bags mostly get them from chain stores, while unpacked tea drinkers usually get them from tea stores.Lastly, groups without preference for tea bags or unpacked also show no preference for buying location.
  • Not lunch tea drinlkers most likely have a preference on the place and type of tea to buy then noty tea drinkers.
  • Earl grey drinkers are more likely to drink tea with milk and sugar then black tea drinkers who are closer to lemon and alone tea.

With this exercise we could explore how the categories are distributed. Remember that these dimensions only account for 11% of the variance. Therefore, always try to explore at least some of the next dimensions (this also applies to PCA).


Chapter 6: Analysis of longitudinal data

date()
## [1] "Thu Dec 09 16:18:54 2021"

Here we go again…

Remember! Set your working directory and load your libraries.

knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")

library(ggpubr)
library(tidyverse)
library(ggplot2)
library(lme4)
library(broom)

Rats Data

The following data is taken from Crowder and Hand (1990). The effect of three diets divided by groups over the weight of mice was studied over 9 weeks. A measurement at time point 0 was made and then measurements were taken weekly, except for week 7 when an extra measurement was taken.

First, we read the data.

rats <- read.csv("rats_long.csv") %>%
   mutate(across(c(ID, Group), factor))

And observe its stucture and summarized values.

str(rats)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time  : int  1 8 15 22 29 36 43 44 50 57 ...
##  $ Weight: int  240 250 255 260 262 258 266 266 265 272 ...
summary(rats)
##        ID      Group       Time           Weight     
##  1      : 11   1:88   Min.   : 1.00   Min.   :225.0  
##  2      : 11   2:44   1st Qu.:15.00   1st Qu.:267.0  
##  3      : 11   3:44   Median :36.00   Median :344.5  
##  4      : 11          Mean   :33.55   Mean   :384.5  
##  5      : 11          3rd Qu.:50.00   3rd Qu.:511.2  
##  6      : 11          Max.   :64.00   Max.   :628.0  
##  (Other):110

We can see we have the long version of the data that has previously been wrangled. An ID factor with 16 levels tells us there were a total of 16 subjects that were weighed 11 times each. we can also see that diet for group 1 was given to double the amount of mice than the separate other two diets. Time seems to be expressed in days instead of weeks. Lastly, the weight of the mice ranges in over 400 grams. Since these measurments have been repeated over the same subjects over an extended period of time, we can label this as longitudinal data.

A great way to start looking at longitudinal data is to plot it directly in a scatter plot of the response, in this case weight vs time. We will also differentiate the plotted lines belonging to each of the three groups.

rats %>%
  ggplot(aes(x = Time, y = Weight, group = ID)) +
  geom_line(aes(col = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  ggtitle("Rat's weight over time") +
  theme_bw()

We can observe that mice with higher weight at the beginning of te study, remain like that through out the study. The same can be said about those with lower weight. Let us try to observe this tracking phenomenon in more detail by standardizing the weight of the mice by time of the measurement.

rats <- rats %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()

rats %>%
  ggplot(aes(x = Time, y = stdweight, group = ID)) +
  geom_line(aes(col = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  ggtitle("Rat's weight over time") +
  theme_bw()

Though it still seems a bit cryptic, the plot shows us a flatter pattern of the weigh measurements, confirming that the trend is there. This is still a complex way to look at the data though. When working with large number of subjects, it is better to look at the average profiles of each of our groups of interest. Let us take the average by time point and group and plot them with their standard error (Note that the se calculation was changed to reflect the non equal number of subjects in each group by simply calling for the n() functrion).

ratsTS <- rats %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n()) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
glimpse(ratsTS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, ~
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2~
## $ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87~
ratsTS %>%
  ggplot(aes(x = Time, y = mean, col = Group, shape = Group)) +
  geom_line() +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width=0.5) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)") +
  theme_bw()

From here we can conclude:

  • The variability in the subjects of group 2 is the highest.
  • Group 1 seems to have the lowest weight, but their starting weight is also the lowest, so further analysis is needed.

If the measurements are equally temporally inter-spaced, we can easily look at a boxplor of each time. Let us remember that it is not the case for this data (week 7), but let us take a look at it.

rats %>%
  mutate(across(c(Time), factor)) %>%
  ggplot(aes(x = Time, y = Weight, fill = Group)) +
  geom_boxplot(alpha = 0.3) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)") +
  theme_bw()

Here we can observe that all three groups seem to have outliers. But let us consider this more carefully once we have performed a summary measure analysis.

Since we want to find out if the rate of change in weight of the mice is the same for every diet, we choose the regression coefficient as a summary measurement. We first calculate the regression coefficient for each ID mice. Then, the resulting data is then summarized in a box plot.

rats_RegS  <- rats %>% 
  group_by(Group, ID) %>% 
  do(regTimeCoeff = tidy(lm(Weight ~ Time, data = .))$estimate[2]) %>%
  mutate(regTimeCoeff = unlist(regTimeCoeff))

glimpse(rats_RegS)
## Rows: 16
## Columns: 3
## Rowwise: 
## $ Group        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID           <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ regTimeCoeff <dbl> 0.4844071, 0.3303560, 0.3980345, 0.3295261, 0.4058091, 0.~
rats_RegS %>%
  ggplot(aes(x = Group, y = regTimeCoeff)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=2, fill = "red") +
  scale_y_continuous(name = "Time estimate coefficient in Lm(weight ~ time)") +
  theme_bw()

We can conclude:

  • The variability in the subjects of group 2 is the highest.
  • Group 1 seems to have the lowest weight change and group 2 the highest.
  • Group 1 seems to have one outlier.

To avoid bias, let us try this again by removing the outliers and storing their ID’s in a variable.

rats_RegS1 <- rats_RegS %>%
  filter(regTimeCoeff>0.25)

outlierIDs <- which(rats_RegS$regTimeCoeff < 0.25) %>%
  rats_RegS$ID[.]

rats_RegS1 %>%
  ggplot(aes(x = Group, y = regTimeCoeff)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=2, fill = "red") +
  scale_y_continuous(name = "Time estimate coefficient in Lm(weight ~ time)") +
  theme_bw()

When we have removed the outlier, it is harder to assert if there is a significant difference between group 1 and 3

We still need a formal confirmation if the coefficient differs between group 1 and 2 or 3. We can use a T test between the different groups, but er then risk on multiple testing error. Let us just look at the values regardless.

groups <- c(3,2,1)
stats <- sapply(groups, function(g){
  test <- t.test(regTimeCoeff ~ Group, data = filter(rats_RegS, Group != g), var.equal = TRUE)
  return(c(test$p.val, unlist(test$conf.int))) 
})
colnames(stats) <- c("G 1 and 2:", "G 1 and 3:", "G 2 and 3:")
rownames(stats) <- c("pval", "conf_int_low", "conf_int_high")
t(stats)
##                   pval conf_int_low conf_int_high
## G 1 and 2: 0.002236939   -0.9372091   -0.27446917
## G 1 and 3: 0.022063830   -0.5439406   -0.05273443
## G 2 and 3: 0.282396430   -0.3297647    0.94476800

We can conclude:

  • The lowest p-val is between group 1 and 2, which are the most separated in our box plot.
  • Group 1 and 3 are still statistically significantly different (p-val<0.05), but not as much.
  • Group 2 and 3 are not statistically different.

Note that multiple testing can incur in errors as mentioned in Chapter 2. Therefore, the linear regression model provides a more accurate description of the relationships. Let us fit a model and look at the results with the summary() and anova() functions.

fit <- lm(regTimeCoeff ~ Group, data = rats_RegS)

summary(fit)
## 
## Call:
## lm(formula = regTimeCoeff ~ Group, data = rats_RegS)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60295 -0.07034  0.04179  0.13911  0.37562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.35964    0.09114   3.946  0.00167 **
## Group2       0.60584    0.15786   3.838  0.00205 **
## Group3       0.29834    0.15786   1.890  0.08127 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2578 on 13 degrees of freedom
## Multiple R-squared:  0.5382, Adjusted R-squared:  0.4671 
## F-statistic: 7.574 on 2 and 13 DF,  p-value: 0.006594
anova(fit)
## Analysis of Variance Table
## 
## Response: regTimeCoeff
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Group      2 1.00665 0.50332  7.5743 0.006594 **
## Residuals 13 0.86387 0.06645                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With this initial exploratory analysis of the data we can conclude:

  • There is a statistically significant difference between the change in weight in Group 1 and 2.
  • There is a statistically significant difference between the change in weight in Group 1 and 3.
  • The difference between the change in weight in Group 2 and 3 is no statistically significant.

BPRS data

The following data is taken from Davis (2002). In the data, 40 male subjects were rated on the brief psychiatric rating scale or BPRS after being treated with one of two treatments. A measurement at time point 0 was made and then measurements were taken every week for 8 weeks. For more information on the BPRS go to this link.

First, we read the data.

bprs <- read.csv("bprs_long.csv") %>%
  mutate(across(c(treatment, subject), factor))

And observe its stucture and summarized values.

str(bprs)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ week     : int  0 1 2 3 4 5 6 7 8 0 ...
##  $ BPRS     : int  42 36 36 43 41 40 38 47 51 58 ...
summary(bprs)
##  treatment    subject         week        BPRS      
##  1:180     1      : 18   Min.   :0   Min.   :18.00  
##  2:180     2      : 18   1st Qu.:2   1st Qu.:27.00  
##            3      : 18   Median :4   Median :35.00  
##            4      : 18   Mean   :4   Mean   :37.66  
##            5      : 18   3rd Qu.:6   3rd Qu.:43.00  
##            6      : 18   Max.   :8   Max.   :95.00  
##            (Other):252

We can see we have the long version of the data that has previously been wrangled. A subject factor with 20 levels tells us there were a total of 40 subjects that were scored in BPRS 9 times each. we can also see that both treatments have the same amount of patients. Time is expressed in weeks. Since these measurments have been repeated over the same subjects over an extended period of time, we can label this as longitudinal data.

A great way to start looking at longitudinal data is to plot it directly in a line plot of the response, in this case BPRS vs week. We will also differentiate the plotted lines belonging to each of the treatments.

bprs %>%
  ggplot(aes(x = week, y = BPRS)) +
  geom_line(aes(linetype = subject)) +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_wrap("treatment", labeller = label_both) +
  theme_bw() +
  theme(legend.position = "none")

At this point, we can notice that there seems to be a tendency in all of the subjects to get lower BPRS over time and become less variable. Regardless, it is hard to conclude anything from such a complex data.

If we look at the data set in its long form, it is easy to visualize it can be studied with a multiple linear regression model. Here we ignore that the readings come from the same subject. Let us fit the data to a linear regression model.

bprs_reg <- lm(BPRS ~ week + treatment, data = bprs)
summary(bprs_reg)
## 
## Call:
## lm(formula = BPRS ~ week + treatment, data = bprs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We can clearly see that the second treatment is not statistically different than first one by looking at the p-val above 0.05. Additionally, the model only explains 18.5 % of the variation in the BPRS measurements. Regardless, this model assumes the independence of the measurements and, therefore, is incorrect. To overcome this issue, we will use linear mixed models.

First, we can fit the data to a random intercept model with week and treatment as the variables affecting BPRS. This regression adds the random effect corresponding to the unobserved variables when calculating the model by allowing for each subject’s intercept to be different.

bprs_ref <- lmer(BPRS ~ week + treatment + (1 | subject), data = bprs, REML = F)
summary(bprs_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

We can see that the random effect variance of the subjects is low compared to the residual variance. This means that there is not a lot of variation between the intercept of the fitted regressions of each subject. Furthermore, the regression estimates for the fixed effect are identical to the regular linear model, and both the intercept and week estimates are significant, but not the treatment 2. This supports that the treatments have no different effect on the BPRS.

Now it is time to calculate the random intercept and random slope model. This model allows for both the slope and intercept to vary between the subjects. This change allow us to take into consideration not only the starting point but also the rate of change in BPRS of the individual subjects together with time. So let us calculate it.

bprs_ref1 <- lmer(BPRS ~ week + treatment + (week | subject), data = bprs, REML = F)
summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

The regression estimates for the fixed effect are identical to the regular linear model and the random intercept model, and both the intercept and week estimates are significant, but not the treatment 2. This supports that the treatments have no different effect on the BPRS.

W can use the anova() to compare our two random models.

anova(bprs_ref1,bprs_ref)
## Data: bprs
## Models:
## bprs_ref: BPRS ~ week + treatment + (1 | subject)
## bprs_ref1: BPRS ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## bprs_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## bprs_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-val of the chi squared in the anova shows a significant value, which confirms the random slope and intercept model fits our data better.

Lastly, let us run our random slope and intercept model but now allowing for treatment X week to see if it affects the BPRS. We immediately compare it with the previous model.

bprs_ref2 <- lmer(BPRS ~ week * treatment + (week | subject), data = bprs, REML = F)
summary(bprs_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week * treatment + (week | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0513 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0048  8.0626        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4702  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(bprs_ref2, bprs_ref1)
## Data: bprs
## Models:
## bprs_ref1: BPRS ~ week + treatment + (week | subject)
## bprs_ref2: BPRS ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## bprs_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## bprs_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At this point, the Chi squared p-val is not significant. this shows that the interaction model is not better, further supporting that there are no significantly different rates of change in the BPRS between the different treatments.

Now, let us compare the original data and the fitted values by the random intercept and random slop regression.

g1 <- bprs %>%
  ggplot(aes(x = week, y = BPRS)) +
  geom_line(aes(linetype = subject)) +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_wrap("treatment", labeller = label_both) +
  theme_bw() +
  theme(legend.position = "none")

Fitted <- fitted(bprs_ref1)

bprs <- bprs %>%
  mutate(Fitted)

g2 <- bprs %>%
  ggplot(aes(x = week, y = Fitted)) +
  geom_line(aes(linetype = subject)) +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_wrap("treatment", labeller = label_both) +
  theme_bw() +
  theme(legend.position = "none")

ggarrange(g1, g2, ncol = 1)

The plot shows very clearly that there is no discernible pattern to differentiate one treatment from another.